{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d39a806c-02a3-4a2d-8c51-f1ab1ea79d2e",
   "metadata": {},
   "source": [
    "# LLM Reasoning\n",
    "\n",
    "This notebook compares how LLMs from different Generative AI providers perform on three examples that can show issues with LLM reasoning:\n",
    "\n",
    "* [The Reversal Curse](https://github.com/lukasberglund/reversal_curse) shows that LLMs trained on \"A is B\" fail to learn \"B is A\".\n",
    "* [How many r's in the word strawberry?](https://x.com/karpathy/status/1816637781659254908) shows \"the weirdness of LLM Tokenization\".  \n",
    "* [Which number is bigger, 9.11 or 9.9?](https://x.com/DrJimFan/status/1816521330298356181) shows that \"LLMs are alien beasts.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2e413bd-983c-42a0-9580-96fedc7b1275",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat ../.env.sample"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d843e36-7de6-4726-8a39-c5dcd3c7cc11",
   "metadata": {},
   "source": [
    "Make sure your ~/.env file (copied from the .env.sample file above) has the API keys of the LLM providers to compare set before running the cell below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c966895-1a63-4922-80b7-5a20e47f29de",
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append('../../aisuite')\n",
    "\n",
    "from dotenv import load_dotenv, find_dotenv\n",
    "\n",
    "load_dotenv(find_dotenv())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09d5c5be-1085-4252-9d5e-80b50961484b",
   "metadata": {},
   "source": [
    "## Specify LLMs to Compare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26c3d5ef-b1c9-48dd-9b89-30799fd4b698",
   "metadata": {},
   "outputs": [],
   "source": [
    "import aisuite as ai\n",
    "\n",
    "client = ai.Client()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "886a904f-fef0-4f25-b3ed-41085bf0f2dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "llms = [\n",
    "        \"anthropic:claude-3-5-sonnet-20240620\",\n",
    "        \"aws:meta.llama3-1-8b-instruct-v1:0\",\n",
    "        \"groq:llama3-8b-8192\",\n",
    "        \"groq:llama3-70b-8192\",\n",
    "        \"huggingface:mistralai/Mistral-7B-Instruct-v0.3\",\n",
    "        \"openai:gpt-3.5-turbo\",\n",
    "       ]\n",
    "\n",
    "def compare_llm(messages):\n",
    "    execution_times = []\n",
    "    responses = []\n",
    "    for llm in llms:\n",
    "        start_time = time.time()\n",
    "        response = client.chat.completions.create(model=llm, messages=messages)\n",
    "        end_time = time.time()\n",
    "        execution_time = end_time - start_time\n",
    "        responses.append(response.choices[0].message.content.strip())\n",
    "        execution_times.append(execution_time)\n",
    "        print(f\"{llm} - {execution_time:.2f} seconds: {response.choices[0].message.content.strip()}\")\n",
    "    return responses, execution_times"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c3e8aa2-4ff4-485b-93d9-4a6f22d62e67",
   "metadata": {},
   "source": [
    "## The Reversal Curse"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3c4a8ef-e23b-4d4a-8561-3e5a2a866bd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "messages = [\n",
    "    {\"role\": \"user\", \"content\": \"Who is Tom Cruise's mother?\"},\n",
    "]\n",
    "\n",
    "responses, execution_times = compare_llm(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "769f7f42-2adb-4903-ab17-3143a5d950ce",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "def display(llms, execution_times, responses):\n",
    "    data = {\n",
    "        'Provider:Model Name': llms,\n",
    "        'Execution Time': execution_times,\n",
    "        'Model Response ': responses\n",
    "    }\n",
    "    \n",
    "    df = pd.DataFrame(data)\n",
    "    df.index = df.index + 1\n",
    "    styled_df = df.style.set_table_styles(\n",
    "        [{'selector': 'th', 'props': [('text-align', 'center')]}, \n",
    "         {'selector': 'td', 'props': [('text-align', 'center')]}]\n",
    "    ).set_properties(**{'text-align': 'center'})\n",
    "    \n",
    "    return styled_df "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d2359ad5-9f0b-4bd6-9838-54df91de0fb3",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(llms, execution_times, responses)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "399f6cca-7f34-4a91-aab0-070560640033",
   "metadata": {},
   "outputs": [],
   "source": [
    "messages = [\n",
    "    {\"role\": \"user\", \"content\": \"Who is Mary Lee Pfeiffer's son?\"},\n",
    "]\n",
    "\n",
    "responses, execution_times = compare_llm(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eee7704d-a187-41bc-b119-c94461d0ee74",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(llms, execution_times, responses)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ada8e0fb-17f0-4781-bf6a-c23ac86922ad",
   "metadata": {},
   "source": [
    "## How many r's in the word strawberry?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e537871e-68b6-44c3-886a-d3ebe7a692c1",
   "metadata": {},
   "outputs": [],
   "source": [
    "messages = [\n",
    "    {\"role\": \"user\", \"content\": \"How many r's in the word strawberry?\"},\n",
    "]\n",
    "\n",
    "responses, execution_times = compare_llm(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5678e393-4967-49f1-9e0f-251471dc92b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(llms, execution_times, responses)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cae3fb5f-a173-4a33-b843-65df6d1086f9",
   "metadata": {},
   "source": [
    "## Which number is bigger?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efdf2fd6-f63a-4f9b-af15-1df25590e4fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "messages = [\n",
    "    {\"role\": \"user\", \"content\": \"Which number is bigger, 9.11 or 9.9?\"},\n",
    "]\n",
    "\n",
    "responses, execution_times = compare_llm(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eaa14ed1-c83b-4c8f-bb14-d318bf0c9a60",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(llms, execution_times, responses)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "198b213a-b7bf-4cce-8c30-a8408454370b",
   "metadata": {},
   "outputs": [],
   "source": [
    "messages = [\n",
    "    {\"role\": \"user\", \"content\": \"Which number is bigger, 9.11 or 9.9? Think step by step.\"},\n",
    "]\n",
    "\n",
    "responses, execution_times = compare_llm(messages)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a3fb8fc-a7a2-47d3-9db2-792f03cc47c2",
   "metadata": {},
   "outputs": [],
   "source": [
    "display(llms, execution_times, responses)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66987d26-4245-4de1-816f-fa57475101f3",
   "metadata": {},
   "source": [
    "## Takeaways\n",
    "1. Not all LLMs are created equal - not even all Llama 3 (or 3.1) are created equal (by different providers).\n",
    "2. Ask LLM to think step by step may help improve its reasoning.\n",
    "3. The way tokenization works in LLM could lead to a lot of weirdness in LLM (see AK's awesome [video](https://www.youtube.com/watch?v=zduSFxRajkE) for a deep dive).\n",
    "4. A more comprehensive benchmark would be desired, but a quick LLM comparison like shown here can be the first step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04e13c90-3680-4f1d-8f65-768a78b7adb2",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
